双簇算法分区数据并同时协变量,提供了几个领域的新见解,例如分析基因表达以发现新的生物学功能。本文使用能量距离(ED)和最大平均差异(MMD)的概念在抽象空间中开发了一种新的无模型双簇算法 - 能够处理复杂数据(例如曲线或图形)的概率分布之间的两个距离。所提出的方法比大多数现有文献方法都可以学习更多的通用和复杂的群集形状,这些方法通常着重于检测均值和方差差异。尽管我们的方法的两次簇配置受到限制,以在基准和协变量级别创建不相交结构,但结果是竞争性的。我们的结果与最佳场景中的最新方法相似,假设有适当的内核选择,当群集差异集中在高阶矩中时,它们的表现优于它们。该模型的性能已在涉及模拟和现实世界数据集的几种情况下进行了测试。最后,使用最佳运输理论的一些工具确定了新的理论一致性结果。
translated by 谷歌翻译
我们研究一般图形结构问题中方差估计的问题。首先,我们为均质的情况开发一个线性时间估计器,该估计器可以始终如一地估计一般图中的方差。我们表明,当平均信号与规范缩放的总变化时,我们的估计器可达到链和2D网格图的最小速率。此外,我们在瞬间条件下在一般图中的融合套索估计器的平均平方误差性能以及误差的尾巴行为上的束缚提供了一般的上限。这些上限使我们能够概括更广泛的分布类别,例如亚指数,在融合拉索上的许多现有结果,这些结果仅在以下假设是误差是次高斯随机变量的假设中。利用我们的上限,我们研究了一个简单的总变异正则估计器,用于估计异源性情况下的方差信号。我们的结果表明,方差估计器达到了估计网格图中有界变化的信号,$ k $ neart的邻居图具有非常温和的假设的最小值,并且对于估计任何连接图中的方差都是一致的。此外,广泛的数值结果表明,我们提出的估计量在各种图形结构模型中表现出色。
translated by 谷歌翻译
在本文中,我们考虑了一种用于主成分分析(PCA)的新变体,旨在同时捕获因子负载的分组和/或稀疏结构。为了实现这些目标,我们采用非凸截面的正则化,具有自然可调的稀疏性和分组效应,并提出了特征分组和稀疏主组件分析(FGSPCA)。所提出的FGSPCA方法鼓励具有相似值的因子负载,以将特征分组或特征零值组分成特征选择的差异均匀组,从而有助于降低模型的复杂性和增加模型解释。通常,现有的结构化PCA方法需要先验知识来构建正则化项。但是,提出的FGSPCA可以同时捕获因子负载的分组和/或稀疏结构,而无需任何事先信息。为了解决所得的非凸优化问题,我们提出了一种交替的算法,该算法结合了Convex编程,增强的Lagrange方法和坐标下降方法。实验结果证明了新方法在合成和现实世界数据集上的有希望的性能和效率。可以在github {https://github.com/higeeks/fgspca}上找到FGSPCA的R实现。
translated by 谷歌翻译
目前的论文研究了最小化损失$ f(\ boldsymbol {x})$的问题,而在s $ \ boldsymbol {d} \ boldsymbol {x} \的约束,其中$ s $是一个关闭的集合,凸面或非,$ \ boldsymbol {d} $是熔化参数的矩阵。融合约束可以捕获平滑度,稀疏或更一般的约束模式。为了解决这个通用的问题,我们将Beltrami-Courant罚球方法与近距离原则相结合。后者是通过最小化惩罚目标的推动$ f(\ boldsymbol {x})+ \ frac {\ rho} {2} \ text {dist}(\ boldsymbol {d} \ boldsymbol {x},s)^ 2 $涉及大型调整常量$ \ rho $和$ \ boldsymbol {d} \ boldsymbol {x} $的平方欧几里德距离$ s $。通过最小化大多数代理函数$ f(\ boldsymbol {x},从当前迭代$ \ boldsymbol {x} _n $构建相应的近距离算法的下一个迭代$ \ boldsymbol {x} _ {n + 1} $。 )+ \ frac {\ rho} {2} \ | \ boldsymbol {d} \ boldsymbol {x} - \ mathcal {p} _ {s}(\ boldsymbol {d} \ boldsymbol {x} _n)\ | ^ 2 $。对于固定$ \ rho $和subanalytic损失$ f(\ boldsymbol {x})$和子质约束设置$ s $,我们证明了汇聚点。在更强大的假设下,我们提供了收敛速率并展示线性本地收敛性。我们还构造了一个最陡的下降(SD)变型,以避免昂贵的线性系统解决。为了基准我们的算法,我们比较乘法器(ADMM)的交替方向方法。我们广泛的数值测试包括在度量投影,凸回归,凸聚类,总变化图像去噪和矩阵的投影到良好状态数的问题。这些实验表明了我们在高维问题上最陡的速度和可接受的准确性。
translated by 谷歌翻译
In optimization-based approaches to inverse problems and to statistical estimation, it is common to augment the objective with a regularizer to address challenges associated with ill-posedness. The choice of a suitable regularizer is typically driven by prior domain information and computational considerations. Convex regularizers are attractive as they are endowed with certificates of optimality as well as the toolkit of convex analysis, but exhibit a computational scaling that makes them ill-suited beyond moderate-sized problem instances. On the other hand, nonconvex regularizers can often be deployed at scale, but do not enjoy the certification properties associated with convex regularizers. In this paper, we seek a systematic understanding of the power and the limitations of convex regularization by investigating the following questions: Given a distribution, what are the optimal regularizers, both convex and nonconvex, for data drawn from the distribution? What properties of a data source govern whether it is amenable to convex regularization? We address these questions for the class of continuous and positively homogenous regularizers for which convex and nonconvex regularizers correspond, respectively, to convex bodies and star bodies. By leveraging dual Brunn-Minkowski theory, we show that a radial function derived from a data distribution is the key quantity for identifying optimal regularizers and for assessing the amenability of a data source to convex regularization. Using tools such as $\Gamma$-convergence, we show that our results are robust in the sense that the optimal regularizers for a sample drawn from a distribution converge to their population counterparts as the sample size grows large. Finally, we give generalization guarantees that recover previous results for polyhedral regularizers (i.e., dictionary learning) and lead to new ones for semidefinite regularizers.
translated by 谷歌翻译
Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and LAION have propelled recent dramatic progress in AI. Large neural models trained on such datasets produce impressive results and top many of today's benchmarks. A notable omission within this family of large-scale datasets is 3D data. Despite considerable interest and potential applications in 3D vision, datasets of high-fidelity 3D models continue to be mid-sized with limited diversity of object categories. Addressing this gap, we present Objaverse 1.0, a large dataset of objects with 800K+ (and growing) 3D models with descriptive captions, tags, and animations. Objaverse improves upon present day 3D repositories in terms of scale, number of categories, and in the visual diversity of instances within a category. We demonstrate the large potential of Objaverse via four diverse applications: training generative 3D models, improving tail category segmentation on the LVIS benchmark, training open-vocabulary object-navigation models for Embodied AI, and creating a new benchmark for robustness analysis of vision models. Objaverse can open new directions for research and enable new applications across the field of AI.
translated by 谷歌翻译
System identification, also known as learning forward models, transfer functions, system dynamics, etc., has a long tradition both in science and engineering in different fields. Particularly, it is a recurring theme in Reinforcement Learning research, where forward models approximate the state transition function of a Markov Decision Process by learning a mapping function from current state and action to the next state. This problem is commonly defined as a Supervised Learning problem in a direct way. This common approach faces several difficulties due to the inherent complexities of the dynamics to learn, for example, delayed effects, high non-linearity, non-stationarity, partial observability and, more important, error accumulation when using bootstrapped predictions (predictions based on past predictions), over large time horizons. Here we explore the use of Reinforcement Learning in this problem. We elaborate on why and how this problem fits naturally and sound as a Reinforcement Learning problem, and present some experimental results that demonstrate RL is a promising technique to solve these kind of problems.
translated by 谷歌翻译
Background: Encouraged by the success of pretrained Transformer models in many natural language processing tasks, their use for International Classification of Diseases (ICD) coding tasks is now actively being explored. In this study, we investigate three types of Transformer-based models, aiming to address the extreme label set and long text classification challenges that are posed by automated ICD coding tasks. Methods: The Transformer-based model PLM-ICD achieved the current state-of-the-art (SOTA) performance on the ICD coding benchmark dataset MIMIC-III. It was chosen as our baseline model to be further optimised. XR-Transformer, the new SOTA model in the general extreme multi-label text classification domain, and XR-LAT, a novel adaptation of the XR-Transformer model, were also trained on the MIMIC-III dataset. XR-LAT is a recursively trained model chain on a predefined hierarchical code tree with label-wise attention, knowledge transferring and dynamic negative sampling mechanisms. Results: Our optimised PLM-ICD model, which was trained with longer total and chunk sequence lengths, significantly outperformed the current SOTA PLM-ICD model, and achieved the highest micro-F1 score of 60.8%. The XR-Transformer model, although SOTA in the general domain, did not perform well across all metrics. The best XR-LAT based model obtained results that were competitive with the current SOTA PLM-ICD model, including improving the macro-AUC by 2.1%. Conclusion: Our optimised PLM-ICD model is the new SOTA model for automated ICD coding on the MIMIC-III dataset, while our novel XR-LAT model performs competitively with the previous SOTA PLM-ICD model.
translated by 谷歌翻译
Sensor-based remote health monitoring is used in industrial, urban and healthcare settings to monitor ongoing operation of equipment and human health. An important aim is to intervene early if anomalous events or adverse health is detected. In the wild, these anomaly detection approaches are challenged by noise, label scarcity, high dimensionality, explainability and wide variability in operating environments. The Contextual Matrix Profile (CMP) is a configurable 2-dimensional version of the Matrix Profile (MP) that uses the distance matrix of all subsequences of a time series to discover patterns and anomalies. The CMP is shown to enhance the effectiveness of the MP and other SOTA methods at detecting, visualising and interpreting true anomalies in noisy real world data from different domains. It excels at zooming out and identifying temporal patterns at configurable time scales. However, the CMP does not address cross-sensor information, and cannot scale to high dimensional data. We propose a novel, self-supervised graph-based approach for temporal anomaly detection that works on context graphs generated from the CMP distance matrix. The learned graph embeddings encode the anomalous nature of a time context. In addition, we evaluate other graph outlier algorithms for the same task. Given our pipeline is modular, graph construction, generation of graph embeddings, and pattern recognition logic can all be chosen based on the specific pattern detection application. We verified the effectiveness of graph-based anomaly detection and compared it with the CMP and 3 state-of-the art methods on two real-world healthcare datasets with different anomalies. Our proposed method demonstrated better recall, alert rate and generalisability.
translated by 谷歌翻译
The physics-informed neural operator (PINO) is a machine learning architecture that has shown promising empirical results for learning partial differential equations. PINO uses the Fourier neural operator (FNO) architecture to overcome the optimization challenges often faced by physics-informed neural networks. Since the convolution operator in PINO uses the Fourier series representation, its gradient can be computed exactly on the Fourier space. While Fourier series cannot represent nonperiodic functions, PINO and FNO still have the expressivity to learn nonperiodic problems with Fourier extension via padding. However, computing the Fourier extension in the physics-informed optimization requires solving an ill-conditioned system, resulting in inaccurate derivatives which prevent effective optimization. In this work, we present an architecture that leverages Fourier continuation (FC) to apply the exact gradient method to PINO for nonperiodic problems. This paper investigates three different ways that FC can be incorporated into PINO by testing their performance on a 1D blowup problem. Experiments show that FC-PINO outperforms padded PINO, improving equation loss by several orders of magnitude, and it can accurately capture the third order derivatives of nonsmooth solution functions.
translated by 谷歌翻译